The dynamics of housing price are a subject of unpredictability. More specifically, there is a complex interplay of various factors contributing to the final market price of a individual property as well as influencing the housing market at large. Regarding Chicago city,the Institute for Housing Studies at Depaul University has identified that the city housing market is currently facing 3 main challenges: loss of affordable rental supply, rising costs for new and existing homeowners, and increasing housing needs of older adults. As such, this report would attempt to achieve 2 main objectives in order to help supporting Chicago policy markers: determine main factors contributing to change in housing price, and segment the Chicago housing markets.
We use 2 data sources, we note that the metadata has the descriptions of all relevant variables and is included in the Appendix section.
Source: Institute for Housing Studies at DePaul University.
Description: This dataset contains information on housing prices across all community areas in Chicago city between Q1 1997 and Q2 2023, offering a comprehensive view of the housing market trends. The dataset contains 4 sheet of information being:
Variables: Median housing prices per community area per quarter of every year (from 2017 to 2022). The variables are estimated using other variables in the dataset.
Usage: The housing price data will serve as the dependent variable in our analysis, against which the impact of various independent variables will be measured.
Reference: Institute for Housing Studies - DePaul University. (n.d.). Cook County House Pric Index. Cook County House Price Index.https://price-index.housingstudies.org/
Source: Chicago Metropolitan Agency for Planning (CMAP).
Description: This dataset encompasses a range of economic and demographic variables from various Chicago community areas, offering a detailed view of local conditions and trends spanning six years (2017-2022).
Variables: there are 6 main categories of variables: Population and Households, Housing Characteristics, Transportation, Employment, Land Use, Change Over Time. Furthermore, each category of variables contain different group of variables with each variables group have from 1 to dozen of variables. For example, the Population and Households category contains General Population Characteristics group which have Total Population variables, Total Households variable, etc. We recommend accessing the a snapshot of a community area directly to understand detailed description of the variable.
Usage: These variables will be used to understand the micro-level economic and demographic factors affecting housing prices in specific Chicago communities.
Reference: Community Data snapshots Raw data (2014 - 2022). (n.d.). https://datahub.cmap.illinois.gov/maps/2a0b0316dc2c4ecfa40a171c635503f8/about
We believe that this report concerns with 3 main ethical considerations: Data Ownership and Privacy, Data Ownership and Usage, and Community and Individual Welfare. In addition, we believe that there are 4 main groups of stakeholders: the Chicago Community Areas (CAs) Residents, Chicago Policy Makers and Urban Planners, Academic and research community, and the General public.
Our analysis is conducted in R studio, leveraging these packages: tidyverse, caret, factoextra, leaps, glmnet, randomForest, partykit, openxlsx, janitor, tibble, magrittr, purrr, plotly, corrplot, rgl, and scales.
Independent variables: We first import our independent variables dataset from CMAP, which contains information regarding a range of economic and demographic variables from various Chicago CAs between 2017 to 2022 from Community Data Snapshots dataset. We access and download these datasets into our local devices. We then access these datasets using R:
# Read the data (filter out columns with na values) and merge them together
file1 <- read_csv("dataset/2017.csv") %>% select(where(~ all(!is.na(.))))
file2 <- read_csv("dataset/2018.csv") %>% select(where(~ all(!is.na(.))))
file3 <- read_csv("dataset/2019.csv") %>% select(where(~ all(!is.na(.))))
file4 <- read_csv("dataset/2020.csv") %>% select(where(~ all(!is.na(.))))
file5 <- read_csv("dataset/2021.csv") %>% select(where(~ all(!is.na(.))))
file6 <- read_csv("dataset/2022.csv") %>% select(where(~ all(!is.na(.))))
Dependent variables: We then import our dependent variables dataset from Institute for Housing Studies at DePaul University, which contains information on housing prices across all sub-markets (each sub-market contains fully some CAs) in Chicago city between Q1 1997 and Q2 2023, offering a comprehensive view of the housing market trends. We access and download the dataset into our local devices. We then access the dataset using R:
# access original data
path <- "Dataset/2023_q2_price_index_data.xlsx"
sheets <- openxlsx::getSheetNames(path)
housing_dataset <- lapply(sheets, openxlsx::read.xlsx, xlsxFile=path)
Since the dependent dataset comes in the form of an excel files with multiple sheets, thus, we create 1 dataset associated with each of these sheets.
# assigning names to original data sheets
names(housing_dataset) <- sheets
## Summary sheet
summary_info <- subset(housing_dataset$Summary, select = c(-1)) %>%
# set CA as row names
column_to_rownames(var = names(.)[1]) %>%
# remove aggregate data for the cook county, the city, and the suburb
slice(1:(n() - 3))
## Quarterly index result sheet
index <- housing_dataset$Quarterly_Index_Results %>%
# set column names
row_to_names(row_number = 1) %>%
# set row names
remove_rownames() %>%
column_to_rownames(var = names(.)[1]) %>%
# convert to numeric data types
mutate(across(where(is.character), ~ as.numeric(.))) %>%
# Convert from range 0 to 100 into 0 to 1.
mutate(across(everything(), ~ .x / 100)) %>%
# remove aggregate data for the cook county, the city, and the suburb
subset(select = -c(1:3))
## Submarket_Definitions sheet
Submarket_Definition <- subset(housing_dataset$Submarket_Definitions, select = c(2:3)) %>%
# remove NA values
na.omit() %>%
# set CA as row names
column_to_rownames(var = names(.)[1])
## Submarket_Annual_Sample_Size sheet
Submarket_Annual_Sample_Size <- housing_dataset$Submarket_Annual_Sample_Size %>%
# remove aggregate data for the whole city
select(-ncol(.)) %>%
slice(1:(n() - 1)) %>%
# use CA as column names
row_to_names(row_number = 1) %>%
# set row names
remove_rownames() %>%
{
rownames(.) <- .[[1]]
.[-1]
}
Independent variables: We filter out variables that do not exist in all years and merge all independent data set dataframes. Thus we have:
# Identify common columns for merging
common_cols <- Reduce(intersect, list(names(file1), names(file2), names(file3), names(file4), names(file5), names(file6)))
# Subset Data Frames to Common Columns & Arrange GEOG as the First Column:
file1_common <- file1[, common_cols] %>% select(GEOG, everything())
file2_common <- file2[, common_cols] %>% select(GEOG, everything())
file3_common <- file3[, common_cols] %>% select(GEOG, everything())
file4_common <- file4[, common_cols] %>% select(GEOG, everything())
file5_common <- file5[, common_cols] %>% select(GEOG, everything())
file6_common <- file6[, common_cols] %>% select(GEOG, everything())
# Merge all datasets together
housing_data <- rbind(file1_common, file2_common, file3_common,
file4_common, file5_common, file6_common)
housing_data$Year <- as.factor(housing_data$Year) # change years to factors
Dependent variables: Within the Cook County Housing Price Index dataset, we use the summary information sheet first to estimate the median housing price for each sub-market for each quarter of every years since Q1 1997 to W2 2023.
# 2000 Q1 price by multiply change from 2000 and current median sale price.
summary_info$`2000_price` <- summary_info$Change.since.2000 * summary_info$`Median.Sales.Price.2022.Q3.-.2023.Q2`
# calculate the price data for each sub-market for each quarter years.
submarket_price <- as.matrix(index) %*% diag(summary_info$`2000_price`) %>%
set_colnames(colnames(index)) %>% as.data.frame()
Given the price data for each sub-market since Q1 1997 to Q2 2023 and each sub-market contains fully some CAs, we determine the median housing price for each CA between Q1 1997 to Q2 2023. For example, given CA A lies fully within sub-market B, as such, the estimated housing price in all time stamps of B will be that of the A.
# create a dataframe contains all the CA within each sub-market with each CA being a datapoint.
CA_within <- Submarket_Definition %>%
# create row name index
rownames_to_column(var = "Sub-Market") %>%
# seperate out each CA.
separate_rows(`Municipalities.or.Chicago.Community.Areas.Entirely.Within*`, sep = ",") %>%
# Group by each row name index
group_by(`Sub-Market`) %>%
# create each CA name being each datapoint.
mutate(row_id = row_number()) %>%
pivot_wider(names_from = row_id, values_from = `Municipalities.or.Chicago.Community.Areas.Entirely.Within*`, names_prefix = "CA_within_") %>%
# transpose the dataset
t() %>%
# make the colnames
row_to_names(row_number = 1) %>%
# change to dataframe type
as.data.frame()
# create a CA price dataset
# Create an empty list to store datapoint
CA_price <- list()
# Loop through each sub-market price data
for (market in colnames(submarket_price)){
# loop through each CA name
for (CA in CA_within[[market]]){
# skip NA value
if (is.na(CA)){next}
# add the sub-market price vector to the CA price vector that lies entirely within
CA_price[[`CA`]] <- submarket_price[[market]]
}
}
# Update the CA price dataframe column names
CA_price <- data.frame(CA_price)
colnames(CA_price) <- gsub("X\\.", "", colnames(CA_price))
colnames(CA_price) <- gsub("\\.", " ", colnames(CA_price))
# Update the CA price dataframe row names
rownames(CA_price) <- rownames(submarket_price)
# Convert rownames to a column named 'YearQuarter'
CA_price <- CA_price %>%
rownames_to_column(var = "YearQuarter")
In our later analysis, our time unit will be that of every year since
2017 to 2022. Hence, for the CA_price dataframe that have
median house price for each quarter year, we take the average of all
quarter for each year as annual datapoints.
# Convert quarter price data into annual price data by taking the average of quarters data points.
price_data <- CA_price%>%
select(-starts_with("Unnamed")) %>%
gather(key = "Location", value = "Price", -YearQuarter) %>%
separate(YearQuarter, into = c("Year", "Quarter"), sep = 4) %>%
mutate(Year = as.numeric(Year)) %>%
filter(Year %in% 2017:2022) %>%
group_by(Year, Location) %>%
summarize(AveragePrice = mean(Price, na.rm = TRUE), .groups = 'drop')
# Only keep GEOG values that are present in both datasets
matching_geog <- intersect(housing_data$GEOG, price_data$Location)
# Filter the price data for matching GEOG values
price_data_filtered <- filter(price_data, Location %in% matching_geog)
price_data_filtered$Year <- as.factor(price_data_filtered$Year)
# Merge the datasets and filter out unnecessary variables
clean_data <- housing_data %>% filter(GEOG %in% matching_geog) %>%
left_join(price_data_filtered, by = c("Year", "GEOG" = "Location"))
Our final clean_data dataset is a comprehensive dataset
of both dependent, being median house price data, and independent
variables, being various CA’s features.
We remove ‘isolated’ variables, being variables that belongs in a
group with missing variables. For example, the Age Cohorts
group contains 3 variables in our final dataset: A20_34,
A35_49, A50_64. Thus, the group is currently
missing variables for showing number of residents have age less than 20
or larger than 64. Hence, we remove these variables. In addition, we
remove variables that we could not infer a definition from the original
dataset.
# remove isolated variables
clean_data <- clean_data %>% subset(select = -c(A20_34, A35_49, A50_64, POP_16OV, POP_25OV, HS, BACH, HU_SNG_DET, HU_SNG_ATT,HU_2UN, HU_3_4UN, HV_LT_150K, HV_150_300K, HV_300_500K, HV_GT_500K, MED_HV, TOT_EMP_RES))
From the employment category variables, we are interested only in the
average aggregate of the employment figures rather than specific number
of employments figure in each top industry or locations due to ethical
consideration. As such, we create average employment of top 5 most
employed sectors of the CA residents variable,
RES_NAICS_AVG; and average employment of top 5 most
employed sectors in CA boundary, WORK_NAICS_AVG. We then
process to remove specific variables.
# Create New variables for average employment of top 5 most employed sectors of CA residents and average employment of top 5 most employed sectors in CA boundary.
# average employment of top 5 most employed sectors of CA residents
clean_data <- clean_data %>%
rowwise() %>%
mutate(RES_NAICS_AVG = sum(RES_NAICS1_COUNT, RES_NAICS2_COUNT, RES_NAICS3_COUNT, RES_NAICS4_COUNT, RES_NAICS5_COUNT, na.rm = T)/5)
# we are interested in the average value of the employment of top 5 most employed sectors of residents only.. Thus we remove all its components.
clean_data <- clean_data %>%
subset(select = -c(RES_NAICS1_COUNT, RES_NAICS2_COUNT, RES_NAICS3_COUNT, RES_NAICS4_COUNT, RES_NAICS5_COUNT))
# average employment of top 5 most employed sectors in CA boundary.
clean_data <- clean_data %>%
rowwise() %>%
mutate(WORK_NAICS_AVG = sum(WORK_NAICS1_COUNT, WORK_NAICS2_COUNT, WORK_NAICS3_COUNT, WORK_NAICS4_COUNT, WORK_NAICS5_COUNT, na.rm = T)/5)
# we are interested in the average value of the employment of top 5 most employed sectors in CA boundary only. Thus we remove all its components.
clean_data <- clean_data %>%
subset(select = -c(WORK_NAICS1_COUNT, WORK_NAICS2_COUNT, WORK_NAICS3_COUNT, WORK_NAICS4_COUNT, WORK_NAICS5_COUNT))
# Remove resident_city_count & work_city_count variables
clean_data <- clean_data %>%
subset(select = -c(RES_CITY1_COUNT, RES_CITY2_COUNT, RES_CITY3_COUNT, RES_CITY4_COUNT, RES_CITY5_COUNT, WORK_CITY1_COUNT, WORK_CITY2_COUNT, WORK_CITY3_COUNT, WORK_CITY4_COUNT, WORK_CITY5_COUNT))
Since our analysis focuses only in numerical variables, thus, we remove all categorical variables except for the geo-code location and the year.
# remove categorical variables (all variables that are non-numeric or non-integer) except for the geo-code location and the year.
clean_data <- clean_data[sapply(clean_data, function(x) is.numeric(x) || is.integer(x)) | colnames(clean_data) %in% c("GEOG", "Year")]
numerical_data <- clean_data %>% # Take only numerical value
select(-c(GEOG,Year))
head(clean_data)
## # A tibble: 6 × 64
## # Rowwise:
## GEOG TOT_POP MED_AGE WHITE HISP BLACK ASIAN OTHER POP_HH IN_LBFRC EMP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Albany Pa… 52079 33.8 15349 25003 2215 8223 1289 50477 29624 26868
## 2 Archer He… 13266 32.0 2572 10109 114 388 82 13256 6594 5808
## 3 Armour Sq… 14068 42.0 1545 438 1348 10324 413 14058 6997 5919
## 4 Ashburn 42752 36.2 5555 15455 20817 284 641 42132 21985 19004
## 5 Auburn Gr… 45842 39.1 231 689 44402 117 403 45727 20187 14910
## 6 Austin 97643 34.7 4353 11132 81066 467 625 96495 41638 33519
## # ℹ 53 more variables: UNEMP <dbl>, NOT_IN_LBFRC <dbl>, WORK_AT_HOME <dbl>,
## # TOT_COMM <dbl>, DROVE_AL <dbl>, CARPOOL <dbl>, TRANSIT <dbl>,
## # WALK_BIKE <dbl>, COMM_OTHER <dbl>, INC_LT_25K <dbl>, INC_25_50K <dbl>,
## # INC_50_75K <dbl>, INC_75_100K <dbl>, INC_100_150K <dbl>, INC_GT_150 <dbl>,
## # MEDINC <dbl>, TOT_HH <dbl>, OWN_OCC_HU <dbl>, RENT_OCC_HU <dbl>,
## # VAC_HU <dbl>, HU_TOT <dbl>, MED_ROOMS <dbl>, HA_AFT2000 <dbl>,
## # HA_70_00 <dbl>, HA_40_70 <dbl>, HA_BEF1940 <dbl>, BR_0_1 <dbl>, …
We would like to first would like the distribution of housing price in the Chicago market. More specifically, we visualize the distribution of housing prices in Chicago between 2017 to 2022.
# Define a list of years
unique_years <- unique(clean_data$Year)
# Define a color palette
colors <- RColorBrewer::brewer.pal(length(unique_years), "Set2")
# Initialize the plot
prices <- plot_ly()
# Determine an appropriate bin width
# This is something you might want to tweak to get the look you want
bin_width <- 200000
# Add a trace for each year when changing the slider
for (i in seq_along(unique_years)) {
year <- unique_years[i]
prices <- prices %>% add_trace(
data = subset(clean_data, Year == year),
x = ~AveragePrice,
type = 'histogram',
visible = i == 1, # Only the first trace is visible initially
name = as.character(year),
marker = list(color = colors[i]),
xbins = list(start = min(clean_data$AveragePrice), end = max(clean_data$AveragePrice), size = bin_width) # Setting bin width here
)
}
# Define the steps for the slider
steps <- lapply(unique_years, function(year, index) {
list(
method = "restyle",
args = list("visible", lapply(unique_years, function(y) y == year)),
label = as.character(year)
)
}, index = seq_along(unique_years))
# Add the slider and update layout
prices <- prices %>% layout(
sliders = list(
list(
active = 0,
currentvalue = list(prefix = "Year: "),
steps = steps
)
),
title = "Distribution of Housing Prices Over Years",
xaxis = list(
title = "Average Price",
# Set the range if you want to focus on a specific portion of the data
range = c(min(clean_data$AveragePrice), max(clean_data$AveragePrice))
),
yaxis = list(title = "Frequency"),
hovermode = 'closest',
font = list(size = 12),
bargap = 0.1 # Adjust the gap between bars if needed
)
# Show the plot
prices
From the graph we see that the distribution of housing price in Chicago is right skewed for all years during the study periods. In addition, we show that the median and average housing price in 2017 to 2022 are $710,849, $1,027,310; and $1,098,684, $1,520,350 respectively. As such, we infer the housing price in Chicago has been growing during the 5 years period. In addition, we see that the increase in prices are not uniform throughout the city, due to the median housing price only increases by 45%, while the average price increases by 38% in the city. Hence, we infer lower than median and average price housing units have seen a larger increase in price compare to that of higher than median and average price housing units.
In determining the impact of different CA’s features to housing price, we could not ignore the impact of such features onto each other. Thus, we have the correlation matrix including all variables that have a strong absolute correlation value of more than 0.75 with another variables. Thus, we have:
# Calculate the correlation matrix
corr_matrix <- cor(numerical_data)
# Set a high threshold for correlation
threshold <- 0.75
corr_matrix[abs(corr_matrix) < threshold] <- NA # Set correlations below threshold to NA
# Function to check if all non-NA values are 1
all_one_or_na <- function(x) all(x[!is.na(x)] == 1)
# Remove columns that contain only 1 and NA
corr_matrix <- corr_matrix[, !apply(corr_matrix, 2, all_one_or_na)]
# Remove rows that contain only 1 and NA
corr_matrix <- corr_matrix[!apply(corr_matrix, 1, all_one_or_na), ]
# Identify variables with at least one significant correlation
significant_vars <- apply(!is.na(corr_matrix), 1, any)
# Filter out variables without significant correlations
corr_matrix_filtered <- corr_matrix[significant_vars, significant_vars]
diag(corr_matrix_filtered) <- NA # Exclude self-correlations
# Create an interactive heatmap with the filtered matrix
fig_corr <- plot_ly(x = colnames(corr_matrix_filtered), y = colnames(corr_matrix_filtered), z = corr_matrix_filtered,
type = "heatmap", colorscale = "Plasma",
zmin = threshold, zmax = 1) %>%
layout(title = "Filtered Correlation Matrix (Absolute Corr >= 0.75)",
xaxis = list(title = "Variables", tickangle = 45, tickfont = list(size = 10)),
yaxis = list(title = "Variables", tickangle = -45, tickfont = list(size = 10)),
margin = list(l = 100, r = 100, t = 100, b = 100))
fig_corr
Given our independent dataset having 62 variables, there are 48 variables that have at least 1 strong absolute correlation value with another variable. As such, since the majority of variables (around 77.42%) have at least 1 strong absolute correlation value with another variable. This indicates that CA’s features may have strong influences to each other. Moreover, due to the amount of strong correlation values between variables, we infer that there exists evidence of multicolinearity between independent variables.
Henceforth, we perform variable selection with the purpose of reducing multicolinearity. More specifically, our method involves performing hierarchical clustering to determine clusters of highly correlated variables and remove all variables within those clusters except for 1 variable. This guarantee no elimination of variables with high correlation but are not in the same cluster/group, thus, decreasing model multicolinearity without increasing bias.
# Calculate the initial correlation matrix
corr_matrix <- cor(numerical_data)
# Perform hierarchical clustering
abs_corr_matrix <- abs(corr_matrix)
diag(abs_corr_matrix) <- 0 # Exclude self-correlations
hc <- hclust(as.dist(1 - abs_corr_matrix), method = "average")
# Identify clusters of highly correlated variables (threshold: 0.9)
high_corr_threshold <- 0.9
clusters <- cutree(hc, h = 1 - high_corr_threshold)
# Select one variable from each cluster
vars_to_keep <- tapply(names(clusters), clusters, function(x) x[1])
# Update numerical dataset by keeping only the selected variables
filtered_numerical_data <- numerical_data[, vars_to_keep, drop = FALSE]
Hence, our new filtered correlation matrix being
# Recalculate the correlation matrix with the filtered data
filtered_corr_matrix <- cor(filtered_numerical_data)
# Set a threshold for displaying significant correlations
threshold <- 0.75
filtered_corr_matrix[abs(filtered_corr_matrix) < threshold] <- NA
diag(filtered_corr_matrix) <- NA # Exclude self-correlations
# Remove columns that contain only 1 and NA
filtered_corr_matrix <- filtered_corr_matrix[, !apply(filtered_corr_matrix, 2, all_one_or_na)]
# Remove rows that contain only 1 and NA
filtered_corr_matrix <- filtered_corr_matrix[!apply(filtered_corr_matrix, 1, all_one_or_na), ]
# Create an interactive heatmap with the filtered correlation matrix
fig_corr <- plot_ly(x = colnames(filtered_corr_matrix), y = colnames(filtered_corr_matrix), z = filtered_corr_matrix,
type = "heatmap", colorscale = "Plasma",
zmin = threshold, zmax = 1) %>%
layout(title = "Filtered Correlation Matrix (Absolute Corr >= 0.75)",
xaxis = list(title = "Variables", tickangle = 45, tickfont = list(size = 10)),
yaxis = list(title = "Variables", tickangle = -45, tickfont = list(size = 10)),
margin = list(l = 100, r = 100, t = 100, b = 100))
fig_corr
Given variable selection based on initial correlation matrix, we now
see that our filtered_numerical_data which contains all
numerical independent variables after removing highly correlated
variables have 44 variables. Of which, there are 28 variables that have
at least 1 strong absolute correlation value with another variable.
Although, the majority of variables (around 63.64%) still have at least
1 strong absolute correlation value with another variable. We believe
that these correlation values do not indicate multicolinearity because
variables that are still highly correlated are not of the same
cluster/group during hierarchical clustering variable selection
process.
We updated clean_data dataframe with filtered numerical
variables in filtered_numerical_data dataframe.
# Remove numerical_data subset out of clean_data
columns_to_keep <- setdiff(names(clean_data), names(numerical_data)) # Identify columns in clean_data that are not in numerical_data
clean_data <- clean_data[columns_to_keep] # Subset clean_data to keep only the columns that are not in numerical_data
# Add filted_numerical_data into clean_data
clean_data <- cbind(clean_data, filtered_numerical_data)
We will proceed with the following statistical analysis and model buildings:
Lasso Regression: Since main dataset, clean_data,
still contains 44. As such, we perform Lasso Regression. The methodology
is used to perform variable selection and regularization by imposing a
penalty on the absolute size of the coefficients. This can help in
continue identifying the most significant predictors for our
dataset.
Principle Component Analysis (PCA) & K-means Clustering: In order to segmenting the housing market in Chicago, we proceed with performing PCA then applying K-means Clustering into the PCA results. More specifically, we perform PCA to reduce the dimensions of the data but still contain majority of information. The dimension reduction is achieved by identifying the principal components (PCs), which are uncorrelated linear combinations of the original variables that captured maximum variance. After determine the PCs, we apply K-Means Clustering to segment the dataset into distinct groups. Henceforth, K-means allows use to determine different segments of the Chicago Community Areas housing market and reveals intrinsic groupings among variables based on underlying structure.
Decision Tree Regression: In order to determine main factors that change Chicago housing price, we apply Decision Tree Regression. We note that Decision tree method can be visualized and are relatively easy to understand. This is because the methodology provides clear decision rules for how inputs’ impact target variable. In addition, the methodology can also handle both quantitative, qualitative predictor variables, and missing data.
We perform Lasso Regression to continue variable selection and increase accuracy.
# Data Preparation for Lasso Regression
# 'Average_Price' is your dependent variable
target <- filtered_numerical_data$AveragePrice
features <- filtered_numerical_data %>% select_if(is.numeric) %>% select(-AveragePrice)
# Data preparation into x and y
x <- as.matrix(features)
y <- target
set.seed(123)
# Lasso regression with best lambda
lasso_model <- glmnet(x, y, alpha=1)
plot(lasso_model)
The resulting Lasso’s coefficient path plot provides visual representation of model’s coefficients as a function of L1 norm, reflecting the cumulative magnitude of the coefficients adjusted by the Lasso penalty, λ. We notice that given increasing λ values, coefficients’ values either stay constant at 0 or change from 0 to non-zero values. We continue with identifying the optimal λ value that minimize Mean Squared Error (MSE) value.
# To find the best lambda
cv_model <- cv.glmnet(x, y, alpha=1)
best_lambda <- cv_model$lambda.min
min_mse <- cv_model$lambda.1se
# Fit the LASSO model using the best lambda value
lasso_best <- glmnet(x, y, alpha = 1, lambda = best_lambda)
print(cv_model)
##
## Call: cv.glmnet(x = x, y = y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 91 99 2.983e+11 3.000e+10 43
## 1se 7209 52 3.256e+11 3.909e+10 34
print(lasso_best)
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = best_lambda)
##
## Df %Dev Lambda
## 1 43 83.44 90.97
Our results indicate an optimal λ of 90.9702792 with cross validation showing that the minimum MSE of model being 7209.2390513. Overall, the model explains 83.44% of deviance in the dependent variable. In addition, since Lasso Regression possesses variable selection property by pushing coefficients of some variables to 0. Thus, we have a list of some non-zero coefficients:
# Extract the coefficients
coefficients <- coef(lasso_best, s = best_lambda)
# Coefficients are returned in a sparse matrix; convert to regular matrix for easier handling
coefficients <- as.matrix(coefficients)
# Identify non-zero coefficients (i.e., variables retained in the model)
non_zero_coefficients <- coefficients[coefficients != 0, , drop = FALSE]
# Convert into matrix form for significant variables
non_zero_coefficients <- as.matrix(non_zero_coefficients)
# Extract the names of the variables with non-zero coefficients (excluding intercept)
selected_features <- colnames(x)[which(non_zero_coefficients[-1,] != 0)]
# Subset the original dataset to include only final features for PCA
final_filtered_data <- numerical_data[selected_features]
# show some non-zero coefficients
print(non_zero_coefficients, max = 6)
## s1
## (Intercept) 5.971136e+06
## TOT_POP -1.763783e+02
## MED_AGE -2.875071e+04
## WHITE -2.480088e+01
## HISP 8.058051e+00
## BLACK -1.338189e-01
## [ reached getOption("max.print") -- omitted 38 rows ]
We notice after Lasso Regression results, we identify 44
non_zero_coefficients. Thus, Lasso Regression reduce number of numerical
variables from 44 to 44. Thus, we update our main dataset,
clean_data.
# Remove numerical_data subset out of clean_data
columns_to_keep <- setdiff(names(clean_data), names(final_filtered_data)) # Identify columns in clean_data that are not in final_filtered_data
clean_data <- clean_data[columns_to_keep] # Subset clean_data to keep only the columns that are not in final_filtered_data
# Add final_filtered_data into clean_data
clean_data <- cbind(clean_data, final_filtered_data)
In order to determine different housing market segment in Chicago, we would implement PCA and K-Means Clustering. We first use PCA for our data.
final_filtered_data$AveragePrice <- clean_data$AveragePrice
# Perform PCA on the dataset with selected features
pca_result <- prcomp(final_filtered_data, scale. = TRUE, center=TRUE)
# Explained Variance
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.9572 2.4972 2.1831 1.71909 1.63380 1.47221 1.35761
## Proportion of Variance 0.3559 0.1417 0.1083 0.06717 0.06067 0.04926 0.04189
## Cumulative Proportion 0.3559 0.4976 0.6059 0.67310 0.73377 0.78303 0.82492
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.10429 1.00627 0.8520 0.8125 0.76706 0.73417 0.64069
## Proportion of Variance 0.02772 0.02301 0.0165 0.0150 0.01337 0.01225 0.00933
## Cumulative Proportion 0.85263 0.87565 0.8921 0.9072 0.92052 0.93277 0.94210
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.61365 0.51218 0.45090 0.44755 0.42444 0.40466 0.39469
## Proportion of Variance 0.00856 0.00596 0.00462 0.00455 0.00409 0.00372 0.00354
## Cumulative Proportion 0.95066 0.95662 0.96124 0.96579 0.96989 0.97361 0.97715
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.3571 0.33735 0.32665 0.29283 0.27840 0.26947 0.24765
## Proportion of Variance 0.0029 0.00259 0.00242 0.00195 0.00176 0.00165 0.00139
## Cumulative Proportion 0.9800 0.98263 0.98506 0.98701 0.98877 0.99042 0.99181
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.24359 0.2295 0.20856 0.19623 0.18043 0.1629 0.15127
## Proportion of Variance 0.00135 0.0012 0.00099 0.00088 0.00074 0.0006 0.00052
## Cumulative Proportion 0.99316 0.9944 0.99535 0.99622 0.99696 0.9976 0.99809
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.14536 0.12757 0.12282 0.11391 0.10090 0.07629 0.04535
## Proportion of Variance 0.00048 0.00037 0.00034 0.00029 0.00023 0.00013 0.00005
## Cumulative Proportion 0.99857 0.99894 0.99928 0.99957 0.99981 0.99994 0.99998
## PC43 PC44
## Standard deviation 0.02636 6.831e-06
## Proportion of Variance 0.00002 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00
# Scree plot to visualize variance explained by each principal component
fviz_eig(pca_result)
From the PCA results, we see that 90% of variance is explained by the first 10 PCs. As such, we believe implementing K-Means Clustering with these 10 PCs would maintain accuracy and increase interpretability of clusters. Thus, we implement K-Means Clustering with only these 10 PCs.
We implement K-Means Clustering for 2 purposes: determine optimal number of segments for the housing market and obtain housing market segments characteristics.
In order to determine the optimal number of clusters, we utilize the Elbow method because of the methodology evaluating the within-cluster sum of squares and the separation distance between clusters. We should note that identification of appropriate number of clusters is crucial as it influences the accuracy of market segmentation and the interpretability of the results.
# Extract the scores for the first 10 principal components according to the Scree plot
pca_scores <- pca_result$x[, 1:10]
# Determine the optimal number of clusters (using the elbow method)
set.seed(123)
wss <- sapply(1:15, function(k){kmeans(pca_scores, k, nstart = 10)$tot.withinss})
plot(1:15, wss, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
# Perform K-means clustering with the chosen number of clusters
# Replace 'k' with the chosen number of clusters based on the elbow method
kmeans_result <- kmeans(pca_scores, centers = 3, nstart = 25)
From the plot above, it’s clear that the optimal number of clusters
lie between 3 to 6 because of Total within clusters sum of squares value
showing signs of flatten or remaining stagnant after. Thus, we proceed
with the optimal number of clusters being 3. We decide on the smallest
possible value of optimal clusters to minimize model’s complexity and
increase the change of correspondence with low, middle, high tiers
categories for increase economics interpretability of the housing
market. As such, we proceed to assign observations’ cluster
classification into final_filtered_data dataset.
# Assign the clusters back into the full original dataset (with year)
final_filtered_data $Clusters<-kmeans_result$cluster # assign the clusters into the full filtered dataset with year and location
final_filtered_data$Clusters <- as.factor(final_filtered_data$Clusters)
final_filtered_data$GEOG <- clean_data$GEOG
final_filtered_data$Year <-clean_data$Year
# Summary of avg housing price for each cluster
summary_table <- final_filtered_data %>% group_by(Clusters) %>%
summarise(
Count = n(),
Mean = mean(AveragePrice, na.rm = TRUE),
SD = sd(AveragePrice, na.rm = TRUE),
Min = min(AveragePrice, na.rm = TRUE),
Median = median(AveragePrice, na.rm = TRUE),
Max = max(AveragePrice, na.rm = TRUE)
)
print(summary_table)
## # A tibble: 3 × 7
## Clusters Count Mean SD Min Median Max
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 163 859093. 642285. 166651. 754552. 3611389.
## 2 2 55 2997324. 1376572. 1573985. 2249095. 5809600.
## 3 3 208 1127029. 1121686. 166651. 753148. 5809600.
From the table results, we could reasonably device the housing market in Chicago into 3 categories: low, middle, and high segment. This is due to the drastic differences in average housing price between each segments with low, high, and middle segment having the average housing price of $859,093, $2,997,324, $1,127,029 respectively. In particular, we tested the difference in market segments by dive into segments’ total population characteristics. We believe that higher population areas tend to have high demand for housing and thus would drive housing price up. As such, we decice to highlight only CAs that belong to the upper 80 percentile and those in the lower 20 percentile in terms of total population.
# Define a color palette for the clusters
num_clusters <- length(unique(final_filtered_data$Clusters))
cluster_colors <- RColorBrewer::brewer.pal(num_clusters, "Set1")
# Get the unique years in the dataset
unique_years <- sort(unique(final_filtered_data$Year))
# Initialize the plot
scatter_plot <- plot_ly()
# Add a trace for each cluster (not each year and cluster combination)
for (j in seq_along(unique(final_filtered_data$Clusters))) {
# Initialize with the first year's data for the cluster
cluster_data <- filter(final_filtered_data, Year == unique_years[1], Clusters == as.factor(j))
# Determine the quantiles for labeling
top_bottom_quantile <- quantile(final_filtered_data$TOT_POP, probs = c(0.2, 0.8))
final_filtered_data$Label <- ifelse(final_filtered_data$TOT_POP > top_bottom_quantile[2] |
final_filtered_data$TOT_POP < top_bottom_quantile[1],
as.character(final_filtered_data$GEOG),
NA)
scatter_plot <- scatter_plot %>% add_trace(
data = final_filtered_data,
x = ~TOT_POP,
y = ~AveragePrice,
type = 'scatter',
mode = 'markers+text',
name = paste("Cluster", j),
marker = list(color = cluster_colors[j]),
text = ~Label,
textposition = 'top center',
hoverinfo = 'text+x+y',
showlegend = TRUE
)
}
# Create the steps for the slider
steps <- vector("list", length(unique_years))
for (i in seq_along(unique_years)) {
steps[[i]] <- list(
method = "restyle",
args = list(
list(
"x" = lapply(seq_along(unique(final_filtered_data$Clusters)),
function(j) filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))$TOT_POP),
"y" = lapply(seq_along(unique(final_filtered_data$Clusters)),
function(j) filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))$AveragePrice),
"text" = lapply(seq_along(unique(final_filtered_data$Clusters)),
function(j) {
year_cluster_data <- filter(final_filtered_data, Year == unique_years[i], Clusters == as.factor(j))
top_bottom_quantile <- quantile(year_cluster_data$TOT_POP, probs = c(0.2, 0.8))
ifelse(year_cluster_data$TOT_POP > top_bottom_quantile[2] |
year_cluster_data$TOT_POP < top_bottom_quantile[1],
as.character(year_cluster_data$GEOG),
NA)
})
)
),
label = as.character(unique_years[i])
)
}
# Add the slider and update the layout
scatter_plot <- scatter_plot %>% layout(
sliders = list(
list(
active = 0,
currentvalue = list(prefix = "Year: "),
steps = steps
)
),
title = "Clustering of Average Housing Prices by Population Over Years",
xaxis = list(title = "Total Population"),
yaxis = list(title = "Average Housing Price"),
hovermode = 'closest'
)
# Show the plot
scatter_plot
The plot illustrates that from 2017 to 2022, Cluster 2 which represents the highest average housing prices segment also encompasses areas with larger populations, with West Town and Rogers Park being prominent examples. Conversely, Cluster 3, characterized by moderate housing prices, consistently correlates with regions of lower population density, notably including Washington Heights and West Pullman. Meanwhile, Cluster 1, which is associated with the lowest average housing prices, exhibits a broad spectrum of population totals, indicating a diverse range of population densities within this cluster with locations such as South Lanwdale or Portage Park.
Finally, we utilize random forest regression in order to determine the most significance variables in impacting housing price.
control <- trainControl(method = "cv", number = 10) # train using 10-fold cross validation
# Define the tuning grid
tune_grid <- expand.grid(mtry = seq(2, length(selected_features), by = 5))
# Subset the data
forest_data <- final_filtered_data %>%
select(-c(Label, Year, GEOG))
set.seed(123) # for reproducibility
# Train the model
rf_model_tuned <- train(AveragePrice ~ .,
data = forest_data,
method = "rf",
trControl = control,
tuneGrid = tune_grid)
# Print the model summary
print(rf_model_tuned)
## Random Forest
##
## 426 samples
## 44 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 383, 382, 383, 384, 384, 384, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 252009.8 0.9700304 156531.5
## 7 223046.8 0.9739722 138351.3
## 12 209009.8 0.9765224 130572.0
## 17 200867.1 0.9775351 128056.5
## 22 203387.6 0.9764808 128640.2
## 27 193892.0 0.9780274 124636.8
## 32 193069.9 0.9779350 122834.4
## 37 189733.5 0.9781133 121689.9
## 42 183967.8 0.9794397 120063.8
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 42.
# Plot the model performance
plot(rf_model_tuned)
We implement Cross-validation to determine the optimal number of randomly selected predictors that minimize RMSE. Our results show that at 42 randomly selected predictors, the RMSE reaches a minimum of 1.8396782^{5}. In addition, we notice that the Random Forest model shows a moderately high R² value at 42 randomly selected predictors being 97.94%, thus, the model explained 97.94% of the variability in the response variable. However, it should be noted that since the model used 42 out of total 48 numerical variables, henceforth, we infer a risk of model’s over-fitting. Therefore, we further investigate model’s performance by looking at out-of-sample error rates. In addition, we determine the top 10 most significance variables from the model:
set.seed(123) # For reproducibility
optimal_rf_model <- randomForest(AveragePrice ~ ., data = clean_data, mtry = 42, ntree = 200) # fit the full model with optimal mtry
plot(optimal_rf_model) # plot the full model performance
# Get the important values in the RF model (node purity)
importance_values <- importance(optimal_rf_model)
# Create a data frame for plotting
importance_df <- data.frame(Variable = rownames(importance_values),
Importance = importance_values[, "IncNodePurity"])
# Ordering variables by importance
importance_df <- importance_df[order(-importance_df$Importance),]
# Filter the top 10 most significant variables
top_10 <- importance_df %>%
top_n(10, Importance) %>%
arrange(desc(Importance))
# Create the ggplot plot
ggplot(top_10, aes(x = reorder(Variable, Importance), y = Importance, fill = Variable)) +
geom_bar(stat = "identity", color = "black", size = 0.5) +
coord_flip() + # Flip coordinates for horizontal layout
theme_minimal() +
labs(title = "Top 10 Variable Importance in Random Forest Model",
x = "Variable",
y = "Increase in Node Purity") +
theme(axis.text.y = element_text(size = 12), # Adjust text size on the y-axis
axis.title.x = element_blank(), # Remove x-axis label
legend.position = "none") # Remove the legend
Based on the error plot, it appears that the model quickly levels off in performance given increasing number of trees. The results suggest that approximately 50 trees is adequate in achiving optimal error rate, further supporting our results of 42 randomly selected predictors.
Moreover, the model determines that top 10 most importance variables
in predicting average housing price. We notice that the most importance
variables are related to transportation categories, being:
WALK_BIKE, TRANSperc, and
TRANSIT. In addition, other variables also possesses
significance being: Mfperc and WORK_AT_HOME.
We should note that the importance scores is derived from node impurity
during tree construction with increasing importance scores meaning
improving node purity.
The variable importance plot provides insights into the factors that play a crucial role in determining housing prices. It appears that the variable ‘WALK_BIKE’ holds significance suggesting that amenities promoting walking and biking have a major influence on housing values. This could be indicative of a rising trend towards awareness and a preference for convenience and well being among Chicago residents.Additionally ‘TRANSperc’ and ‘TRANSIT’ are also identified as variables emphasizing the importance of transportation infrastructure when evaluating housing prices. This aligns with the idea that easy access to transport can boost property appeal and subsequently drive up housing prices. Moreover, they are related to racial and land-use demographics which emphasize the importance of these economic variables. For instance, multi-family and single-family demographics are ranked quite high in importance, indicating that the mix of multi-family and single-family homes in an area may reflect broader trends in housing demand and urban development patterns. The variable ‘INC_GT_150’ highlights how economic prosperity affects housing prices, indicating that neighborhoods with higher income households are associated with expensive real estate. Moreover the significance of the ‘VACANT’ variable underscores how supply impacts pricing dynamics as areas with vacant homes may experience different market conditions compared to those with fewer vacancies.
Given these findings from the table and example clustering plot, a recommendation for stakeholders and policymakers would be to focus on differentiated strategies for each cluster. For Cluster 2, enhancing existing infrastructure and local services could further increase property values, maintaining the high housing prices in these community areas. In Clusters 1 and 3 (with noticeably lower average housing prices), there is an opportunity to develop housing policies that support affordability while also considering interventions to stabilize and increase property values, such as community development projects or incentivizing new local businesses to address the diverse needs of these communities.
The table below shows a detailed descriptions of variables:
# Import variables description metadata.
Variables_Description_Metadata <- read_excel("Dataset/Variables Description Metadata.xlsx")
# Create a table with caption for variables description.
table <- kable(Variables_Description_Metadata, format = "html", caption = "Variable Definitions: Author's creation using CMAP dataset")
pretty_table <- table %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F)
# Appendix table
pretty_table
| Variables Names | Description |
|---|---|
| GEOG | location |
| TOT_POP | total pop |
| POP_HH | TOTAL NO. OF HOUSEHOLDS |
| A20_34 | Counts of residents within the CA that lies within the age of 20 to 34. |
| A35_49 | Counts of residents within the CA that lies within the age of 35 to 49. |
| A50_64 | Counts of residents within the CA that lies within the age of 50 to 64. |
| MED_AGE | The median age of the residents in CA. |
| WHITE | The number of residents in CA that have racial category as White, Non-Hispanic |
| HISP | The number of residents in CA that have racial category as Hispanic or Latino of any race. |
| BLACK | The number of residents in CA that have racial category as Black, Non-Hispanic |
| ASIAN | The number of residents in CA that have racial category as Asian, Non-Hispanic |
| OTHER | The number of residents in CA that have racial category as Other/Multiple Races, Non-Hispanic |
| POP_16OV | No Definition founded, variable removed during analysis |
| IN_LBFRC | Counts of residents within the CA that are categorized as in the labor force |
| EMP | Counts of residents within the CA that are categorized as in the labor force and employed |
| UNEMP | Counts of residents within the CA that are categorized as in the labor force and unemployed |
| NOT_IN_LBFRC | Counts of residents within the CA that are categorized as not in the labor force |
| WORK_AT_HOME | Counts of residents within the CA that work at home rather than travel to work |
| TOT_COMM | Total number of residents within the CA that commute to work |
| DROVE_AL | Counts of residents within the CA that drive alone to work |
| CARPOOL | Counts of residents within the CA that carpool to work |
| TRANSIT | Counts of residents within the CA that use public transportation to work |
| WALK_BIKE | Counts of residents within the CA that use a bike or walk to work |
| COMM_OTHER | Counts of residents within the CA that use other mode of travel to work |
| POP_25OV | No Definition founded, variable removed during analysis |
| HS | Counts of residents within the CA that have a high school diploma or equivalent |
| BACH | Counts of residents within the CA that have a bachelor’s degree |
| INC_LT_25K | Counts of households within the CA that have an income less than $25,000 |
| INC_25_50K | Counts of households within the CA that have an income between $25,000 and $50,000 |
| INC_50_75K | Counts of households within the CA that have an income between $50,000 and $75,000 |
| INC_75_100K | Counts of households within the CA that have an income between $75,000 and $100,000 |
| INC_100_150K | Counts of households within the CA that have an income between $100,000 and $150,000 |
| INC_GT_150 | Counts of households within the CA that have an income more than $150,000 |
| MEDINC | The median household income within the CA |
| TOT_HH | Total number of Occupied Housing Units |
| OWN_OCC_HU | Total number of Owner Occupied Housing Units |
| RENT_OCC_HU | Total number of Renter Occupied Units |
| VAC_HU | Total number of Vacant Housing Units |
| HU_TOT | Total number of Housing Units |
| HU_SNG_DET | Counts of Housing Units for a single Family, detached |
| HU_SNG_ATT | Counts of Housing Units for a single Family, attached |
| HU_2UN | Counts of Housing Units that have 2 within units |
| HU_3_4UN | Counts of Housing Units that have 3 to 4 within units |
| HA_AFT2000 | Counts of housing units that are built after 2000 |
| HA_70_00 | Counts of housing units that are built between 1970 and 2000 |
| HA_40_70 | Counts of housing units that are built between 1940 and 1970 |
| HA_BEF1940 | Counts of housing units that are built before 1940 |
| MED_HA | Median year built of housing units withIn CA |
| BR_0_1 | Counts of housing units that have 0 to 1 bedrooms |
| BR_2 | Counts of housing units that have 2 bedrooms |
| BR_3 | Counts of housing units that have 3 bedrooms |
| BR_4 | Counts of housing units that have 4 bedrooms |
| BR_5 | Counts of housing units that have 5 bedrooms |
| MED_ROOMS | Median number of rooms within a housing unit in the CA |
| HV_LT_150K | No Definition founded, variable removed during analysis |
| HV_150_300K | No Definition founded, variable removed during analysis |
| HV_300_500K | No Definition founded, variable removed during analysis |
| HV_GT_500K | No Definition founded, variable removed during analysis |
| MED_HV | No Definition founded, variable removed during analysis |
| TOT_EMP_RES | No Definition founded, variable removed during analysis |
| RES_NAICS1_TYPE | The largest employer sector of CA residents |
| RES_NAICS2_TYPE | The 2nd largest employer sector of CA residents |
| RES_NAICS3_TYPE | The 3rd largest employer sector of CA residents |
| RES_NAICS4_TYPE | The 4th largest employer sector of CA residents |
| RES_NAICS5_TYPE | The 5th largest employer sector of CA residents |
| RES_NAICS1_COUNT | Number of employment of the largest employer sector of CA residents |
| RES_NAICS2_COUNT | Number of employment of the 2nd largest employer sector of CA residents |
| RES_NAICS3_COUNT | Number of employment of the 3rd largest employer sector of CA residents |
| RES_NAICS4_COUNT | Number of employment of the 4th largest employer sector of CA residents |
| RES_NAICS5_COUNT | Number of employment of the 5th largest employer sector of CA residents |
| TOT_EMP_WORK | Total number of employment of top 5 industry sector with largest employed population from CA |
| WORK_NAICS1_TYPE | The largest employer sector in CA boundary |
| WORK_NAICS2_TYPE | The 2nd largest employer sector in CA boundary |
| WORK_NAICS3_TYPE | The 3rd largest employer sector in CA boundary |
| WORK_NAICS4_TYPE | The 4th largest employer sector in CA boundary |
| WORK_NAICS5_TYPE | The 5th largest employer sector in CA boundary |
| WORK_NAICS1_COUNT | Number of employment of the largest employer sector in CA boundary |
| WORK_NAICS2_COUNT | Number of employment of the 2nd largest employer sector in CA boundary |
| WORK_NAICS3_COUNT | Number of employment of the 3rd largest employer sector in CA boundary |
| WORK_NAICS4_COUNT | Number of employment of the 4th largest employer sector in CA boundary |
| WORK_NAICS5_COUNT | Number of employment of the 5th largest employer sector in CA boundary |
| RES_CITY1_TYPE | Top 1 employment locations of CA residents |
| RES_CITY2_TYPE | Top 2 employment locations of CA residents |
| RES_CITY3_TYPE | Top 3 employment locations of CA residents |
| RES_CITY4_TYPE | Top 4 employment locations of CA residents |
| RES_CITY5_TYPE | Top 5 employment locations of CA residents |
| RES_CITY1_COUNT | Number of employment of top 1 employment locations of CA residents |
| RES_CITY2_COUNT | Number of employment of top 2 employment locations of CA residents |
| RES_CITY3_COUNT | Number of employment of top 3 employment locations of CA residents |
| RES_CITY4_COUNT | Number of employment of top 4employment locations of CA residents |
| RES_CITY5_COUNT | Number of employment of top 5 employment locations of CA residents |
| WORK_CITY1_TYPE | Location with most employed people in the CA |
| WORK_CITY2_TYPE | Location with 2nd most employed people in the CA |
| WORK_CITY3_TYPE | Location with 3rd most employed people in the CA |
| WORK_CITY4_TYPE | Location with 4th most employed people in the CA |
| WORK_CITY5_TYPE | Location with 5th most employed people in the CA |
| WORK_CITY1_COUNT | Number of employment of location with most employed people in the CA |
| WORK_CITY2_COUNT | Number of employment of location with 2nd most employed people in the CA |
| WORK_CITY3_COUNT | Number of employment of location with 3rd most employed people in the CA |
| WORK_CITY4_COUNT | Number of employment of location with 4th most employed people in the CA |
| WORK_CITY5_COUNT | Number of employment of location with 5th most employed people in the CA |
| AVG_VMT | Average Vehicle Miles Traveled per Year of household in CA |
| RET_SALES | No Definition founded, variable removed during analysis |
| RES_EAV | No Definition founded, variable removed during analysis |
| CMRCL_EAV | No Definition founded, variable removed during analysis |
| IND_EAV | No Definition founded, variable removed during analysis |
| RAIL_EAV | No Definition founded, variable removed during analysis |
| FARM_EAV | No Definition founded, variable removed during analysis |
| MIN_EAV | No Definition founded, variable removed during analysis |
| TOT_EAV | No Definition founded, variable removed during analysis |
| OPEN_SPACE_PER_1000 | Accessible Park Acreage per 1,000 Residents in the CA |
| TOT_ACRES | Total Acres of land in the CA |
| SF | Total acres of land in the CA for single family residential |
| Sfperc | Percent of total acres of land in the CA for single family residential |
| MF | Total acres of land in the CA for multi-family residential |
| Mfperc | Percent of total acres of land in the CA for multi-family residential |
| MIX | Total acres of land in the CA for mixed use purpose |
| MIXperc | Percent of total acres of land in the CA for mixed use purpose |
| COMM | Total acres of land in the CA for commercial purpose |
| COMMperc | Percent of total acres of land in the CA for commercial purpose |
| INST | Total acres of land in the CA for institutional purpose |
| INSTperc | Percent of total acres of land in the CA for institutional purpose |
| IND | Total acres of land in the CA for industrial purpose |
| INDperc | Percent of total acres of land in the CA for industrial purpose |
| TRANS | Total acres of land in the CA for transportation and Other |
| TRANSperc | Percent of total acres of land in the CA for transportation and Other |
| AG | Total acres of land in the CA for agriculture purpose |
| Agperc | Percent of total acres of land in the CA for agriculture purpose |
| OPEN | Total acres of land in the CA that are open space |
| OPENperc | Percent of total acres of land in the CA that are open space |
| VACANT | Total acres of land in the CA that are vacant |
| VACperc | Percent of total acres of land in the CA that are vacant |
| Year | The year of the dataset |